library(tidyverse)
library(bayesrules)
library(bayesplot)
library(rstan)
library(rstanarm)
library(broom.mixed)
library(tidybayes)
library(forcats)
library(pROC)
options(mc.cores = parallel::detectCores())
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.2 ✔ readr 2.1.4 ✔ forcats 1.0.0 ✔ stringr 1.5.0 ✔ ggplot2 3.4.2 ✔ tibble 3.2.1 ✔ lubridate 1.9.2 ✔ tidyr 1.3.0 ✔ purrr 1.0.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors This is bayesplot version 1.10.0 - Online documentation and vignettes at mc-stan.org/bayesplot - bayesplot theme set to bayesplot::theme_default() * Does _not_ affect other ggplot2 plots * See ?bayesplot_theme_set for details on theme setting Loading required package: StanHeaders rstan (Version 2.21.8, GitRev: 2e1f913d3ca3) For execution on a local, multicore CPU with excess RAM we recommend calling options(mc.cores = parallel::detectCores()). To avoid recompilation of unchanged Stan programs, we recommend calling rstan_options(auto_write = TRUE) Attaching package: ‘rstan’ The following object is masked from ‘package:tidyr’: extract Loading required package: Rcpp This is rstanarm version 2.21.4 - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors! - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults. - For execution on a local, multicore CPU with excess RAM we recommend calling options(mc.cores = parallel::detectCores()) Attaching package: ‘rstanarm’ The following object is masked from ‘package:rstan’: loo Type 'citation("pROC")' for a citation. Attaching package: ‘pROC’ The following objects are masked from ‘package:stats’: cov, smooth, var
coffee_data <- coffee_ratings %>%
select(species, flavor) %>%
mutate( species=(species=="Robusta") )
sample_n( coffee_data, 10 )
| species | flavor |
|---|---|
| <lgl> | <dbl> |
| FALSE | 7.58 |
| FALSE | 7.08 |
| FALSE | 7.75 |
| FALSE | 7.58 |
| FALSE | 6.83 |
| FALSE | 7.83 |
| FALSE | 7.67 |
| FALSE | 7.42 |
| FALSE | 7.50 |
| FALSE | 8.08 |
options(repr.plot.width=15, repr.plot.height=5)
ggplot( coffee_data, aes(x=flavor, fill=species) ) + geom_density( alpha=0.5 )
Fit a logistic regression model with
$$Y_i | \beta_0, \beta_1 \sim \text{Bern}(\pi_i),\quad \text{with } \log\left(\frac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 X_i,$$$$\beta_{0,c} \sim N(m_0, s_0^2),$$$$\beta_{1} \sim N(m_1, s_1^2)$$ggplot( trees, aes(x=Height, y=Girth) ) + geom_point() + geom_smooth( method="lm" )
`geom_smooth()` using formula = 'y ~ x'
Fit a linear regression model with
$$Y_i | \beta_0, \beta_1, \sigma \sim N(\mu_i, \sigma^2),\quad \text{with } \mu_i = \beta_0 + \beta_1 X_i,$$$$\beta_{0,c} \sim N(m_0, s_0^2),$$$$\beta_{1} \sim N(m_1, s_1^2)$$$$\sigma \sim \text{Exp}(l)$$head( radon )
| floor | county | log_radon | log_uranium | |
|---|---|---|---|---|
| <int> | <fct> | <dbl> | <dbl> | |
| 1 | 1 | AITKIN | 0.83290912 | -0.6890476 |
| 2 | 0 | AITKIN | 0.83290912 | -0.6890476 |
| 3 | 0 | AITKIN | 1.09861229 | -0.6890476 |
| 4 | 0 | AITKIN | 0.09531018 | -0.6890476 |
| 5 | 0 | ANOKA | 1.16315081 | -0.8473129 |
| 6 | 0 | ANOKA | 0.95551145 | -0.8473129 |
Uranium levels are given on the county level, Radon values on house level. Each county has thus a specific distribution for log_radon that may depend on log_uranium. If the county is known, the Uranium level is irrelevant, as we know the distribution of the radon level, however if the county is not known, the Uranium level might help as to determine the Radon level. This is a bit of a special situation, but it could nevertheless be encoded with a hierarchical model with random intercepts, the intercepts adapting to the contribution by the Uranium level to predict the mean log_radon concentration.
Distribution of log radon values per county:
options(repr.plot.width=15, repr.plot.height=10)
ggplot( radon, aes(x=log_radon, y=fct_reorder(county, log_radon, .fun='mean') ) ) + geom_boxplot()
Relationship between log uranium concentration and mean log radon concentration per county:
options(repr.plot.width=15, repr.plot.height=5)
radon %>%
group_by( county ) %>%
summarize( log_uranium=mean(log_uranium), mean_log_radon=mean(log_radon) ) %>%
ggplot( aes(x=log_uranium, y=mean_log_radon) ) +
geom_point() +
geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'
There is a more or less linear relationship between county-wide Radon concentration mean and county-ide Uranium concentration.
Variant of a normal hierarchical model with random intercepts:
$$Y_{ij} | \beta_{0j}, \beta_1, \sigma_y \sim N(\mu_j, \sigma_y^2), \quad \text{with } \mu_{j} = {\beta_{0j} + \beta_1 X_{j}}$$$$\beta_{0j}|\beta_0, \sigma_0 \sim N(\beta_0, \sigma_0^2)$$$$\beta_{0c}\sim N(m_0, s_0^2)$$$$\beta_{1}\sim N(m_1, s_1^2)$$$$\sigma_y \sim \text{Exp}(l_y)$$$$\sigma_0 \sim \text{Exp}(l_0)$$Here $Y_{ij}$ are the individual log radon values $i$ for county $j$ and $X_{j}$ the individual log uranium values for county $j$. Notice the special form of the model with $X_j$ and $\mu_j$ being only county-dependent. In the next chapter of the book you will learn that $X_j$ is a group-level predictor.
head( roaches )
| y | roach1 | treatment | senior | exposure2 | |
|---|---|---|---|---|---|
| <int> | <dbl> | <int> | <int> | <dbl> | |
| 1 | 153 | 308.00 | 1 | 0 | 0.800000 |
| 2 | 127 | 331.25 | 1 | 0 | 0.600000 |
| 3 | 7 | 1.67 | 1 | 0 | 1.000000 |
| 4 | 7 | 3.00 | 1 | 0 | 1.000000 |
| 5 | 0 | 2.00 | 1 | 0 | 1.142857 |
| 6 | 0 | 0.00 | 1 | 0 | 1.000000 |
dim( roaches )
ggplot( roaches, aes(x=y, fill=as.factor(treatment) ) ) + geom_density( alpha = 0.5 )
Fit a linear regression model with
$$Y_i | \beta_0, \beta_1, \sigma \sim N(\mu_i, \sigma^2),\quad \text{with } \mu_i = \beta_0 + \beta_1 X_i,$$$$\beta_{0,c} \sim N(m_0, s_0^2),$$$$\beta_{1} \sim N(m_1, s_1^2)$$$$\sigma \sim \text{Exp}(l)$$where $X_i$ is 1 if the appartment has received pest control treatment and 0 otherwise.
head( book_banning )
| title | book_id | author | date | year | removed | explicit | antifamily | occult | language | lgbtq | violent | state | political_value_index | median_income | hs_grad_rate | college_grad_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <fct> | <chr> | <date> | <dbl> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <fct> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | House of the Spirits, The | 927 | Allende, Isabel | 2005-04-01 | 2005 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | AK | -13.4 | 15707.5 | 8.738042 | 0.6762701 |
| 2 | It's Not the Stork!: A Book About Girls, Boys, Babies and Bodies | 1024 | Harris, Robie | 2008-02-06 | 2008 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | AK | -13.4 | 15707.5 | 8.738042 | 0.6762701 |
| 3 | King Stork | 1087 | Pyle, Howard | 2008-10-02 | 2008 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | AK | -13.4 | 15707.5 | 8.738042 | 0.6762701 |
| 4 | How They Met and Other Stories | 936 | Levithan, David | 2008-10-05 | 2008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | AK | -13.4 | 15707.5 | 8.738042 | 0.6762701 |
| 5 | Ghost in the Shell | 764 | Masamune, Shirow | 2008-10-02 | 2008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | AK | -13.4 | 15707.5 | 8.738042 | 0.6762701 |
| 6 | King Stork | 1087 | Pyle, Howard | 2003-09-13 | 2003 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | AK | -13.4 | 15707.5 | 8.738042 | 0.6762701 |
dim( book_banning )
I prefer numeric values for the reasons:
book_banning <- book_banning %>%
mutate_at( vars(removed:violent), \(x) {as.numeric(x)-1} )
In the US, each state has very different political views and corresponding laws. In conservative states, anti-family books might be banned more easily and in democratic states, discriminating books might be banned more easily. Examine this with a plot:
book_banning %>%
group_by( state ) %>%
summarize( removal_rate = mean( removed ) ) %>%
ggplot( aes(x=fct_reorder(state, removal_rate), y=removal_rate) ) +
geom_point() +
xlab("State") +
ylab("Mean removal rate")
The average removal rates differ significantly! What about the proportion of reasons?
book_banning %>%
select( removed:state ) %>%
filter( removed==1 ) %>%
select( -removed ) %>%
group_by( state ) %>%
summarize_all( sum ) %>%
pivot_longer( explicit:violent, names_to="reason", values_to="count" ) %>%
ggplot( aes(x=state, y=count, fill=reason) ) +
geom_bar( stat="identity" )
Here $Y_{ij}$ indicates whether a book is rejected or not and $X_{ij1}$, $X_{ij2}$ and $X_{ij3}$ are dummy variables that encode whether the book was classified as violent, antifamily or language.
book_banning %>%
count( state ) %>%
arrange( desc(n) )
| state | n |
|---|---|
| <chr> | <int> |
| PA | 148 |
| OR | 118 |
| CO | 81 |
| CA | 50 |
| IL | 39 |
| MI | 36 |
| VA | 34 |
| OH | 29 |
| FL | 26 |
| NY | 25 |
| OK | 23 |
| IN | 20 |
| NC | 20 |
| MN | 18 |
| IA | 16 |
| SC | 15 |
| ID | 14 |
| GA | 13 |
| KS | 13 |
| KY | 13 |
| TN | 13 |
| VT | 13 |
| AZ | 12 |
| LA | 12 |
| MO | 11 |
| NJ | 11 |
| AL | 10 |
| WI | 10 |
| AK | 9 |
| WA | 9 |
| MA | 8 |
| MT | 8 |
| NH | 7 |
| CT | 6 |
| ND | 6 |
| AR | 5 |
| MD | 5 |
| WY | 4 |
| ME | 3 |
| NM | 3 |
| RI | 3 |
| SD | 3 |
| WV | 3 |
| DE | 2 |
| NE | 2 |
| MS | 1 |
| UT | 1 |
Pennsylvania had most challenges (148) and Utah and Mississipi the least (1).
book_banning %>%
group_by( state ) %>%
summarize( removal_rate = mean( removed ), count = n() ) %>%
arrange( desc(removal_rate) )
| state | removal_rate | count |
|---|---|---|
| <chr> | <dbl> | <int> |
| MS | 1.00000000 | 1 |
| ID | 0.85714286 | 14 |
| KS | 0.76923077 | 13 |
| MO | 0.72727273 | 11 |
| NM | 0.66666667 | 3 |
| VA | 0.52941176 | 34 |
| AL | 0.50000000 | 10 |
| DE | 0.50000000 | 2 |
| NY | 0.44000000 | 25 |
| AZ | 0.41666667 | 12 |
| LA | 0.41666667 | 12 |
| SC | 0.40000000 | 15 |
| TN | 0.38461538 | 13 |
| MA | 0.37500000 | 8 |
| OK | 0.34782609 | 23 |
| OH | 0.34482759 | 29 |
| ND | 0.33333333 | 6 |
| RI | 0.33333333 | 3 |
| SD | 0.33333333 | 3 |
| WA | 0.33333333 | 9 |
| CA | 0.32000000 | 50 |
| FL | 0.30769231 | 26 |
| GA | 0.30769231 | 13 |
| IL | 0.30769231 | 39 |
| IN | 0.30000000 | 20 |
| WI | 0.30000000 | 10 |
| NH | 0.28571429 | 7 |
| IA | 0.25000000 | 16 |
| WY | 0.25000000 | 4 |
| NC | 0.20000000 | 20 |
| MI | 0.19444444 | 36 |
| NJ | 0.18181818 | 11 |
| CT | 0.16666667 | 6 |
| KY | 0.15384615 | 13 |
| CO | 0.12345679 | 81 |
| AK | 0.11111111 | 9 |
| PA | 0.07432432 | 148 |
| MN | 0.05555556 | 18 |
| OR | 0.04237288 | 118 |
| AR | 0.00000000 | 5 |
| MD | 0.00000000 | 5 |
| ME | 0.00000000 | 3 |
| MT | 0.00000000 | 8 |
| NE | 0.00000000 | 2 |
| UT | 0.00000000 | 1 |
| VT | 0.00000000 | 13 |
| WV | 0.00000000 | 3 |
Mississipi has the highest removal rate with 100%, however only one book was challenged, Idaho has a removal rate of 86% with 14 challenged books, what is probably more significant. 7 states have a removal rate of 0%, Vermote the most significant with in total 13 challenged books that were not removed.
Three different plots (could probably be done in one)
book_banning %>%
group_by( removed, language ) %>%
summarize( count=n() ) %>%
mutate( proportion=count/sum(count) ) %>%
mutate( removed=as.factor(removed), language=as.factor(language) ) %>%
ggplot( aes(x=removed, y=proportion, fill=language) ) +
geom_bar( stat="identity" )
`summarise()` has grouped output by 'removed'. You can override using the `.groups` argument.
book_banning %>%
group_by( removed, antifamily ) %>%
summarize( count=n() ) %>%
mutate( proportion=count/sum(count) ) %>%
mutate( removed=as.factor(removed), antifamily=as.factor(antifamily) ) %>%
ggplot( aes(x=removed, y=proportion, fill=antifamily) ) +
geom_bar( stat="identity" )
`summarise()` has grouped output by 'removed'. You can override using the `.groups` argument.
book_banning %>%
group_by( removed, violent ) %>%
summarize( count=n() ) %>%
mutate( proportion=count/sum(count) ) %>%
mutate( removed=as.factor(removed), violent=as.factor(violent) ) %>%
ggplot( aes(x=removed, y=proportion, fill=violent) ) +
geom_bar( stat="identity" )
`summarise()` has grouped output by 'removed'. You can override using the `.groups` argument.
All predictors give only weak information, probably our model will not explain too much variance (grouping by state might however help).
book_model_1 <- stan_glmer(
removed ~ language + antifamily + violent + (1 | state),
data = book_banning, family = binomial,
prior_intercept = normal(0, 2.5, autoscale = TRUE),
prior = normal(0, 2.5, autoscale = TRUE),
prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
chains = 4, iter = 5000*2, seed = 84735
)
MCMC diagnostics:
options(repr.plot.width=15, repr.plot.height=15)
mcmc_trace(book_model_1, size = 0.1)
mcmc_dens_overlay(book_model_1)
mcmc_acf(book_model_1)
neff_ratio(book_model_1)
rhat(book_model_1)
Warning message: “The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0. ℹ Please use the `rows` argument instead. ℹ The deprecated feature was likely used in the bayesplot package. Please report the issue at <https://github.com/stan-dev/bayesplot/issues/>.”
Looks good! I find it difficult to bring the autocorrelation plots to a good scale.
Posterior median model at 80% level:
tidy(book_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.80)
| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| (Intercept) | -1.1912906 | 0.1941925 | -1.44911049 | -0.9411182 |
| language | 0.3318395 | 0.1903472 | 0.08802237 | 0.5753550 |
| antifamily | -1.0319176 | 0.4835658 | -1.68526989 | -0.4450214 |
| violent | 0.4734280 | 0.2367321 | 0.17149035 | 0.7712990 |
or in other words
$$p(X_1, X_2, X_3) = \frac{1}{1+exp(-(-1.20 + 0.33 X_1 - 1.04 X_2 + 0.48 X_3))}$$c( exp(0.33), exp(-1.04), exp(0.48) )
If the book was challenged for inappropriate language, the odds of it being removed increase by almost 40%. If the book contains anti-family material and was challenged, the odds of it being removed decrease by almost 65%. Finally, if the book was challenged for violent content, the odds of it being removed increase by 62%. In the global model it appears that anti-family content prevents a book from being removed.
All effects appear to be significant at the 80% level, since none of the 80% credible intervals contain zero.
exp(0.33) * exp(0.48)
Inappropriate language and violent content and no anti-family content, then the odds for being removed increase by a factor of >2.
I pick a threshold of 0.5, however one should all check all thresholds and rather argue in terms of AUC.
set.seed(84735)
csummary <- classification_summary(data = book_banning, model = book_model_1, cutoff = 0.5)
csummary
| y | 0 | 1 |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 0 | 700 | 14 |
| 1 | 181 | 36 |
| <dbl> | |
|---|---|
| sensitivity | 0.1658986 |
| specificity | 0.9803922 |
| overall_accuracy | 0.7905478 |
At a threshold of 0.5, the accuracy of the model is almost 80%, with extremely low sensitivity (again sensitivity could be tuned by tuning the threshold). The accuracy looks impressive, is however due to the class imbalance in the data:
table( book_banning$removed ) / nrow(book_banning)
0 1
0.7669173 0.2330827
Almost 80% of the books are not banned anyway, so a model that predicts not removed all the time would perform very similar to our model.
predictions <- as.numeric( colMeans( posterior_predict( book_model_1, book_banning, type="response" ) ) )
roc_object <- roc( book_banning$removed, predictions )
auc(roc_object)
Setting levels: control = 0, case = 1 Setting direction: controls < cases
options(repr.plot.width=10, repr.plot.height=10)
plot(roc_object)
One could dive further into the class imbalance problem here..
tidy(book_model_1, effects = "ran_vals", conf.int = TRUE, conf.level = 0.80) %>%
filter( level %in% c("KS", "OR") )
| level | group | term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| KS | state | (Intercept) | 1.530648 | 0.5388864 | 0.864348 | 2.253114 |
| OR | state | (Intercept) | -1.735703 | 0.4055810 | -2.290942 | -1.242795 |
c( exp(1.53), exp(-1.73) )
The baseline odds for a book to be removed once challenged in Kansas are 4:1, while they are only 0.18:1 $\approx$ 1:5.5 in Oregon. This makes sense when looking at the data:
book_banning %>%
filter( state %in% c("KS", "OR") ) %>%
group_by( state ) %>%
summarize( removal_rate = mean( removed ) )
| state | removal_rate |
|---|---|
| <chr> | <dbl> |
| KS | 0.76923077 |
| OR | 0.04237288 |
predictions <- posterior_predict( book_model_1, newdata=data.frame(
state=c("KS", "OR"),
language=c(1,1),
antifamily=c(0,0),
violent=c(0,0))
)
colMeans( predictions )
66% of the posterior models predict that the book will be banned in Kansas and 7.1% that it will be banned in Oregon.
colnames( basketball )
dim( basketball )
basketball %>% count( year )
| year | n |
|---|---|
| <int> | <int> |
| 2019 | 146 |
Each row represents the statistics of a player in 2019.
Some player names appear in multiple teams:
basketball %>% count( player_name, team ) %>% count( player_name ) %>% filter( n>1 )
| player_name | n |
|---|---|
| <chr> | <int> |
| Alaina Coates | 3 |
| Asia Taylor | 3 |
| Bridget Carleton | 3 |
| Karlie Samuelson | 3 |
| Kristine Anigwe | 3 |
| Theresa Plaisance | 3 |
It appears that these are team changes within the year. It is not quite clear how to treat these exceptions, but they are few in number, so we neglect them.
basketball %>% summarize(
n_player_statistics=n(),
n_unique_players=n_distinct(player_name),
n_teams=n_distinct(team)
)
| n_player_statistics | n_unique_players | n_teams |
|---|---|---|
| <int> | <int> | <int> |
| 146 | 134 | 13 |
13, see a)
options(repr.plot.width=15, repr.plot.height=5)
ggplot( basketball, aes(x=avg_points, y=total_minutes, color=starter)) + geom_point() + geom_smooth( method="lm" )
`geom_smooth()` using formula = 'y ~ x'
People who score more average points are also more often starters. People with more average points play longer throughout the season. This appears to make sense, better people are probably kept longer in the game. The relationship between total minutes and average points scored is more or less linear with very similar slopes.
ggplot( basketball, aes(x=games_played, y=total_minutes, color=starter)) + geom_point() + geom_smooth( method="lm" )
`geom_smooth()` using formula = 'y ~ x'
The more games people played the more total minutes they had. Starters have more total minutes at the same number of games. The relationship between total minutes and games played is roughly linear. The fitted blue model in the plot for starters appears not too trustable at the lower end, probably assuming the same slope for starters and non-starters is reasonable.
Here $Y_{ij}$ indicates the total minutes player $i$ of team $j$ played, $X_{ij1}$ stands for avg_points, $X_{ij2}$ is an indicator variable for starter and $X_{ij3}$ stands for games_played.
total_minutes is a positive number, of all the distributions we have seen in this book, only the Poisson and the negative binomial distributions are ready to model outcomes with only positive values.
ggplot( basketball, aes(x=total_minutes, fill=team) ) + geom_density( alpha=0.2 )
It is hard to find a Poisson distribution in here, we can only find out with a posterior predictive check whether a Poisson regression fits or we should use a negative binomial regression (if expection and variance are not the same) - the only models we know in this book to deal with only positive values.
The distribution of total_minutes varies quite a bit per team:
ggplot( basketball, aes(x=team, y=total_minutes) ) + geom_boxplot()
The number of team members is not very large for a hierarchical regression models with group-specific intercepts and three different slopes:
ggplot( basketball, aes(x=team) ) + geom_bar()
set.seed(84735)
basketball_model_poisson <- stan_glmer(
total_minutes ~ games_played + starter + avg_points + (1 | team),
data = basketball, family = poisson,
prior_intercept = normal(0, 2.5, autoscale = TRUE),
prior = normal(0, 2.5, autoscale = TRUE),
prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
chains = 4, iter = 5000*2, seed = 84735
)
Diagnostics:
options(repr.plot.width=15, repr.plot.height=15)
mcmc_trace(basketball_model_poisson, size = 0.1)
mcmc_dens_overlay(basketball_model_poisson)
mcmc_acf(basketball_model_poisson)
neff_ratio(basketball_model_poisson)
rhat(basketball_model_poisson)
Looks quite ok at first sight!
Posterior predictive check:
options(repr.plot.width=15, repr.plot.height=5)
pp_check(basketball_model_poisson) + xlab("total minutes")
The hierarchical Poisson model is quite close to reality, but interestingly it is much too confident! Might this be because of the assumption that variance equals expectation?
Negative binomial model:
set.seed(84735)
basketball_model_negbinom <- stan_glmer(
total_minutes ~ games_played + starter + avg_points + (1 | team),
data = basketball, family = neg_binomial_2,
prior_intercept = normal(0, 2.5, autoscale = TRUE),
prior = normal(0, 2.5, autoscale = TRUE),
prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
chains = 4, iter = 5000*2, seed = 84735
)
Diagnostics:
options(repr.plot.width=15, repr.plot.height=15)
mcmc_trace(basketball_model_negbinom, size = 0.1)
mcmc_dens_overlay(basketball_model_negbinom)
mcmc_acf(basketball_model_negbinom)
neff_ratio(basketball_model_negbinom)
rhat(basketball_model_negbinom)
Looks reasonably converged.
Normal model:
set.seed(84735)
basketball_model_normal <- stan_glmer(
total_minutes ~ games_played + starter + avg_points + (1 | team),
data = basketball, family = gaussian,
prior_intercept = normal(0, 2.5, autoscale = TRUE),
prior = normal(0, 2.5, autoscale = TRUE),
prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
chains = 4, iter = 5000*2, seed = 84735
)
Diagnostics:
options(repr.plot.width=15, repr.plot.height=15)
mcmc_trace(basketball_model_normal, size = 0.1)
mcmc_dens_overlay(basketball_model_normal)
mcmc_acf(basketball_model_normal)
neff_ratio(basketball_model_normal)
rhat(basketball_model_normal)
Looks reasonably converged.
options(repr.plot.width=15, repr.plot.height=5)
pp_check(basketball_model_negbinom) + xlab("total minutes") + xlim(0, 1650)
Warning message: “Removed 138 rows containing non-finite values (`stat_density()`).”
The negative binomial model is more uncertain than the Poisson model and better captures the data.
pp_check(basketball_model_normal) + xlab("total minutes")
The normal model predicts a substantial amount of negative total minutes. For this reason I would not choose it to model this data, even if it performed the best.
set.seed(84735)
prediction_summary(model = basketball_model_poisson, data = basketball)
| mae | mae_scaled | within_50 | within_95 |
|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> |
| 38.57865 | 2.548435 | 0.1575342 | 0.3767123 |
set.seed(84735)
prediction_summary(model = basketball_model_negbinom, data = basketball)
| mae | mae_scaled | within_50 | within_95 |
|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> |
| 37.07328 | 0.4544823 | 0.6917808 | 0.9520548 |
set.seed(84735)
prediction_summary(model = basketball_model_normal, data = basketball)
| mae | mae_scaled | within_50 | within_95 |
|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> |
| 62.4532 | 0.7035239 | 0.4726027 | 0.9726027 |
The Poisson model has a similar median absolute error as the negative binomial model, shows however a significantly larger scaled median absolute error and captures only 16% of all datapoints within its 50% credible intervals and only 38% within its 95% credible interval. This model should rather not be used, as it substantially underestimates uncertainty. The negative binomial model performs much better here with more reasonable results for mae_scaled, within_50 and within_95. Interestingly, the normal performs significantly worse in terms of median absolute error, however it performs better than the Poisson model in the other measures.
Differences in terms of ELPD:
set.seed(84735)
elpd1 <- loo(basketball_model_poisson, k_threshold=0.7)
elpd2 <- loo(basketball_model_negbinom, k_threshold=0.7)
elpd3 <- loo(basketball_model_normal, k_threshold=0.7)
loo_compare(elpd1, elpd2, elpd3)
18 problematic observation(s) found. Model will be refit 18 times. Fitting model 1 out of 18 (leaving out observation 13) Fitting model 2 out of 18 (leaving out observation 16) Fitting model 3 out of 18 (leaving out observation 31) Fitting model 4 out of 18 (leaving out observation 39) Fitting model 5 out of 18 (leaving out observation 43) Fitting model 6 out of 18 (leaving out observation 63) Fitting model 7 out of 18 (leaving out observation 68) Fitting model 8 out of 18 (leaving out observation 70) Fitting model 9 out of 18 (leaving out observation 73) Fitting model 10 out of 18 (leaving out observation 81) Fitting model 11 out of 18 (leaving out observation 83) Fitting model 12 out of 18 (leaving out observation 84) Fitting model 13 out of 18 (leaving out observation 85) Fitting model 14 out of 18 (leaving out observation 90) Fitting model 15 out of 18 (leaving out observation 92) Fitting model 16 out of 18 (leaving out observation 105) Fitting model 17 out of 18 (leaving out observation 119) Fitting model 18 out of 18 (leaving out observation 126) All pareto_k estimates below user-specified threshold of 0.7. Returning loo object. All pareto_k estimates below user-specified threshold of 0.7. Returning loo object.
| elpd_diff | se_diff | elpd_loo | se_elpd_loo | p_loo | se_p_loo | looic | se_looic | |
|---|---|---|---|---|---|---|---|---|
| basketball_model_normal | 0.0000000 | 0.00000 | -863.3841 | 8.310301 | 7.005304 | 0.9653374 | 1726.768 | 16.62060 |
| basketball_model_negbinom | -0.5856422 | 17.37618 | -863.9698 | 16.436069 | 10.603503 | 2.2909870 | 1727.940 | 32.87214 |
| basketball_model_poisson | -1051.8485090 | 245.43060 | -1915.2326 | 245.739646 | 351.972863 | 144.0119836 | 3830.465 | 491.47929 |
Interestingly, the normal model performs a tiny, non-significant margin better than the negative binomial model in terms of ELPD. The Poisson model is much worse, probably because it is too certain. It would be interesting to run some deeper investigations here. Let us choose the negative binomial model for further examinations.
As discussed above, choose the negative binomial model.
Global model:
tidy(basketball_model_negbinom, effects = "fixed", conf.int = TRUE, conf.level = 0.80)
| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| (Intercept) | 3.22590511 | 0.080575735 | 3.12014170 | 3.3272140 |
| games_played | 0.07784400 | 0.003517451 | 0.07338436 | 0.0822994 |
| starterTRUE | 0.17852032 | 0.077207482 | 0.08037786 | 0.2773162 |
| avg_points | 0.09689437 | 0.011999255 | 0.08144796 | 0.1126978 |
All fixed effects appear to be significant at the 80% level.
where $X_1$ denotes the number of games played, $X_2$ is an indicator for whether the team member was used as a starter and $X_3$ denotes the average number of points the player scored.
c(exp(3.12), exp(3.32), exp(0.073), exp(0.082), exp(0.080), exp(0.277), exp(0.081), exp(0.113))
The baseline is 23-28 minutes played in the season. This baseline is increased by 8% per game the player has played in the season, by 8-32% if the player was used as a starter and by 8-12% with each point the player has scored.
Intercepts:
tidy(basketball_model_negbinom, effects = "ran_vals", conf.int = TRUE, conf.level = 0.80) %>%
arrange( desc(estimate) ) %>%
mutate( baseline_difference=exp(estimate) )
| level | group | term | estimate | std.error | conf.low | conf.high | baseline_difference |
|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| NYL | team | (Intercept) | 0.0726261578 | 0.08199996 | -0.007239109 | 0.190994750 | 1.0753285 |
| PHO | team | (Intercept) | 0.0685500499 | 0.07984687 | -0.008274054 | 0.187975585 | 1.0709542 |
| DAL | team | (Intercept) | 0.0497479001 | 0.07031646 | -0.019710644 | 0.157827379 | 1.0510061 |
| LAS | team | (Intercept) | 0.0478304912 | 0.07044620 | -0.023998892 | 0.158986958 | 1.0489928 |
| TOT | team | (Intercept) | 0.0169588446 | 0.06664789 | -0.064692201 | 0.131703285 | 1.0171035 |
| MIN | team | (Intercept) | 0.0003715544 | 0.05849000 | -0.081561008 | 0.086171826 | 1.0003716 |
| IND | team | (Intercept) | -0.0049049449 | 0.05842614 | -0.096176292 | 0.077544670 | 0.9951071 |
| SEA | team | (Intercept) | -0.0171817606 | 0.06232905 | -0.118300787 | 0.060309130 | 0.9829650 |
| CON | team | (Intercept) | -0.0213308828 | 0.05982796 | -0.118854380 | 0.051066556 | 0.9788950 |
| ATL | team | (Intercept) | -0.0243505583 | 0.06104048 | -0.121613248 | 0.049141909 | 0.9759435 |
| CHI | team | (Intercept) | -0.0459061390 | 0.07046683 | -0.158924103 | 0.028434013 | 0.9551316 |
| WAS | team | (Intercept) | -0.0534048697 | 0.07664904 | -0.171591811 | 0.022742621 | 0.9479961 |
| LVA | team | (Intercept) | -0.0766880362 | 0.09071436 | -0.211633056 | 0.008359801 | 0.9261787 |
The NYL players have the largest total playing time on average, 7.5% better than global average, and the LVA players have the shortest totatal playing time on average, 7% worse than global average.
.. are left up to you, the concepts should be clear now.